In [1]:
import lxml.etree
import os

0. Preparing Data

Before digging into the parser notebook, the version of the CAPEC xml file within this notebook is v2.11, which can be downloaded from this link. Here we loaded CAPEC v21 xml file. Therefore, if there is any new version of XML raw file, please make change for the following code. If the order of attack pattern table is changed, please change the code for function extract_target_field_elements in section 2.1.


In [2]:
capec_xml_file='capec_v2.11.xml'

1. Introduction

The purpose of this notebook is to build the fields parser and extract the contents from various fields in the CAPEC v2.11 so that the field content can be directly analyzed and stored into a database. Guided by CAPEC Introduction notebook, this notebook will focus on the detail structure under attack pattern table and how parser functions work within the attack pattern table.

To preserve the semantic information and not lose details during the transformation from the representation on website and XML file to the output file, we build a 3-step pipeline to modularize the parser functions for various fields in different semantic format. The 3-step pipeline contains the following steps: searching XML Field node location, XML field node parser, and exporting the data structure to the output file based on the semantic format in Section 4 of CAPEC Introduction Notebook. More details will be explained in Section 2.

2. Parser Architecture

The overall parser architecture is constituted by the following three procedures: 1) extracting the nodes with the target field tag, 2) parsing the target field node to the representation in memory, and 3) exporting the data structure to the output file.

Section 2.1 explains the way to search XML field nodes with the target field tag. No matter parsing which field, the first step is to use Xpath and then locate all XML field nodes with the field tag we are intended to parse. The function in section 2.1 has been tested for all fields and thus can locate XML nodes with any given field naming, except Summary under Description node. If parsing Summary, please make change for the Xpath and the code to extract CAPEC-id.

Section 2.2 explains the way to parse and extract the content of the target field into the representation in memory. Since different fields have various nested structures in xml raw file and the content we will parse varies field by field, the worst situation is that there will be one parser function for each different field. However, from Section 4 in CAPEC Introduction Notebook, certain fields may share a same format on website, such as table or bullet list, the ideal situation is that we would have only 4 or 5 functions to represent the data in memory.

Section 3 addresses the way to export the data representation from Section 2.2. A set of functions in Section 3 should be equal to the number of data structures in Section 2.2.

2.1 XML Field Node Location

This function searches the tree for the specified field node provided as input and returns the associated XML node of the field. The string containing the field name can be found in the Introductory Notebook's histogram on Section 4 . As it can be observed in that histogram, only certain fields are worthwhile parsing due to their occurrence frequency.

Since some CAPED-id nodes in capec xml file may have the empty field node that does not contain any information, the function will remove all empty field nodes from the results of Xpath.

In addition, since Summary field is under Description field, please make change for target_field_path when parsing Summary.


In [3]:
def extract_target_field_elements(target_field, capec_xml_file):
    '''
    This function responsibility is to abstract how nodes are found given their field name and should be used together with the histogram.
    
    Args:
        - target_field: the arg defines which nodes are found given by the field that we are aiming to target
        - capec_xml_file: the CAPEC xml file that this function will work and extract the target field nodes
    Outcome:
        - a list of nodes that have the pre-defined target field as the element tag
    '''
    # read xml file and store as the root element in memory
    tree = lxml.etree.parse(capec_xml_file)
    root = tree.getroot()
    
    # Remove namespaces from XML.  
    for elem in root.getiterator(): 
        if not hasattr(elem.tag, 'find'): continue  # (1)
        i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
        if i >= 0: 
            elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'

    # define the path of target field. Here we select all element nodes that the tag is the target field
    # if the target field is Summary field, please change the Xpath and the code to extract capec_id:
    # target_field_path='Attack_Pattern/Description/'+target_field
    target_field_path='Attack_Pattern/./'+target_field

    # extract attack pattern table in the XML
    attack_pattern_table = root[2]
    
    # generate all elements with the target field name
    target_field_nodes=attack_pattern_table.findall(target_field_path)
    
    # remove all empty field s
    for target_field_node in target_field_nodes:
        if type(target_field_node.text)==type(None):
            target_field_nodes.remove(target_field_node)
    
    return target_field_nodes

2.2 XML Node Field Parser

Once the node is provided by the former function, its XML structure is consistent for that field, and CAPEC version, but it is inconsistent for different versions. For example, a table represented in XML is different than a paragraph in XML. However, even for the same expected structure such as a table, the XML may have different tags used within it.

The associated parser function then is tested to cover all possible cases of this field in the XML, and also interpreted against its .html format to understand its purpose. We again refer to the introductory notebook Sections 4 and 5 which convey respectively the potential usage of the field for PERCEIVE and the overall format of the structure when compared to others.

The purpose is then documented as part of the functions documentation (sub-sections of this Section), conveying the rationale behind what was deemed potentially useful in the content to be kept, as well as how certain tags were mapped to a data structure while being removed.

The parser function outputs one of the known data structures (i.e. memory representation) that is shared among the different fields of what is deemed relevant. For example, while 5 field nodes may have their relevant information in different ways, they may all be at the end of the day tables, named lists, or single blocks of texts. Sharing a common representation in this stage decouples the 3rd step of the pipeline from understanding the many ways the same information is stored in the XML.

Because different CAPEC versions may organize a .html table, bullet list etc in different ways even for the same field, this organization also decouples the previous and following section functions from being modified on every new version if necessary.

The following fields have parser functions as of today:

Field Name Function Name
Solutions_and_Mitigations parse_list_format_field
Attack_Prerequisites parse_list_format_field

2.2.1 Parse the fields in list format

Both Solutions and Mitigations and Attack Prerequisites have the same structure under the field element in XML file. As shown in the following images, the only difference is how they are represented on the website: The format of Solutions and Mitigations is unbulleted list with qualified entries, while the format of Attack Prerequisites is unordered list. Therefore, the parsing function should work and output the same representation in memory for both fields.

To understand the nesting structure, here we use the Solutions and Mitigations field for CAPEC-10 as example. Under Solutions_and_Mitigations element, there are four mitigation entries named by 'Solution_or_Mitigation', which represent the actions or approaches to prevent or mitigate the risk of the attack. Under each Solution_or_Mitigation node, there is a text element that directly contains the information that the function will parse.

Since the 'text' tag does not contain any valuable information, we use the dictionary to pair the CAPEC id and the content we parse from the field without any tag. The content will be represented as a list that each element is the content under 'text' node. In summary, the data structure in memory can be represented as the following format: {CAPEC_id: [solution_or_mitigation_1, solution_or_mitigation_2..]}. More details can be found below.

Since the number of elements in the content list depending on the CAPEC_id, here is the cardinality of these tags:

  • For Solutions and Mitigations field
Tag Cardinality
solution_or_mitigation 1 or more
  • For Attack Prerequisites field
Tag Cardinality
attack_prerequisite 1 or more
  • How Solutions and Mitigations represent on the website (CAPEC-10)

  • How Solutions and Mitigations represent in the xml file (CAPEC-10)

  • How Attack Prerequisites represent on the website (CAPEC-10)

  • How Attack Prerequisites represent in the xml file (CAPEC-10)


In [4]:
def parse_list_format_field(field_node):
    '''
    The parser function concern is abstracting how Solutions_and_Mitigations field and Attack_Prerequisites field are stored in XML, 
    and provide it in a common and simpler data structure
    
    Args:
        - field_node: the node that has the Solutions_and_Mitigations tag or Attack_Prerequisites tag, such as the above image
    
    Outcomes: 
        - A dictionary that pairs capec_id as key and the content list as value.  
          In the dictionary, the content list will be a list of strings that contain bullet or unbullet content
          More details can be found in the following example for CAPEC-10.
    ''' 
    
    # extract capec_id from the attribute of its parent node
    capec_id=field_node.getparent().attrib.get('ID')
    capec_id='CAPEC_'+capec_id
    field_tag=field_node.tag
    field_content_list=[]
    # for each field entry node under the target field node
    for field_entry in list(field_node):
        # traverse all entry_element nodes under each field entry
        for entry_text in list(field_entry):
            # generate tag and content of each entry_element
            entry_text_content=entry_text.text
            # if there is not content, then move to next entry element
            if type(entry_text_content)==type(None):
                    continue
            #add each point to the content list
            field_content_list.append(entry_text_content)
    # pair the capec_id with the content list     
    field_dict=dict()
    field_dict[capec_id]=field_content_list
    
    return field_dict

Through Function parse_list_format_field, the above Solutions_and_Mitigations node for CAPEC-10 will be parsed into the following data format in memory.

  • How content represent in memory for Solutions_and_Mitigations field (CAPEC-10)

Through Function parse_list_format_field, the above Attack_Prerequisites node for CAPEC-10 will be parsed into the following data format in memory.

  • How content represent in memory for Attack_Prerequisites field (CAPEC-10)

3. Export Data Structure

At the point this notebook is being created, it is still an open end question on how will tables, bullet lists and other potential structures in CAPEC will be used for topic modeling. For example, should rows in a table be concatenated into a paragraph and used as a document? What if the same word is repeated as to indicate a lifecycle?

In order to keep this notebook future-proof, this section abstracts how we will handle each field representation (e.g. table, bullet list, etc) from the memory data structure. It also keeps it flexible for multi-purpose: A table may be parsed for content for topic modeling, but also for extracting graph relationships (e.g. the Related Attack Pattern and Related Weaknesses fields contain hyperlinks to other CAPEC entries which could be reshaped as a graph).


In [5]:
def export_data(target_field_node):
    '''This section code will be done in the future.'''
    pass

4. Main Execution

After developing the 3 steps parsing pipeline, this section will combine these 3 steps and produce the output file for different fields. As introduced in Section 2, although the parsing procedure keeps same for all fields, each field will have own parsing function, while the same format of fields may share a same exporting function. As a result, the main execution varies for each field.

4.1 Main execution for Solutions and Mitigations

The main execution will combine the above 3 steps parsing pipeline for Solutions_and_Mitigations. After developing function export_data, the following code should produce the output file that contains the parsed content of Solutions_and_Mitigations for all CAPEC_id.


In [11]:
if __name__ == "__main__":
    # extract the nodes, whose tag is Solutions and Mitigation, from capec_xml_file
    solutions_and_mitigations_nodes=extract_target_field_elements('Solutions_and_Mitigations',capec_xml_file)
    # read each Solutions and Mitigations node
    for solutions_and_mitigations_node in solutions_and_mitigations_nodes:
        # parse the content for each solutions_and_mitigation_node
        solutions_and_mitigations_info=parse_list_format_field(solutions_and_mitigations_node)
        # export the parsed content  TO-DO
        export_data(solutions_and_mitigations_info)

4.2 Main execution for Attack Prerequisites

The main execution will combine the above 3 steps parsing pipeline for Attack_Prerequisites. After developing function export_data, the following code should produce the output file that contains the parsed content of Attack_Prerequisites for all CAPEC_id.


In [9]:
if __name__ == "__main__":
    # extract the nodes, whose tag is Attack Prerequisites, from capec_xml_file
    attack_prerequisites_nodes=extract_target_field_elements('Attack_Prerequisites',capec_xml_file)
    # read each Attack_Prerequisites node
    for attack_prerequisites_node in attack_prerequisites_nodes:
        # parse the content for each Attack_Prerequisites node
        attack_prerequisites_info=parse_list_format_field(attack_prerequisites_node)
        # export the parsed content  TO-DO
        export_data(attack_prerequisites_info)